Analysis of cardiac single-cell RNA-sequencing data can be improved by the use of artificial-intelligence-based tools

Single-cell RNA sequencing (scRNAseq) enables researchers to identify and characterize populations and subpopulations of different cell types in hearts recovering from myocardial infarction (MI) by characterizing the transcriptomes in thousands of individual cells. However, the effectiveness of the currently available tools for processing and interpreting these immense datasets is limited. We incorporated three Artificial Intelligence (AI) techniques into a toolkit for evaluating scRNAseq data: AI Autoencoding separates data from different cell types and subpopulations of cell types (cluster analysis); AI Sparse Modeling identifies genes and signaling mechanisms that are differentially activated between subpopulations (pathway/gene set enrichment analysis), and AI Semisupervised Learning tracks the transformation of cells from one subpopulation into another (trajectory analysis). Autoencoding was often used in data denoising; yet, in our pipeline, Autoencoding was exclusively used for cell embedding and clustering. The performance of our AI scRNAseq toolkit and other highly cited non-AI tools was evaluated with three scRNAseq datasets obtained from the Gene Expression Omnibus database. Autoencoder was the only tool to identify differences between the cardiomyocyte subpopulations found in mice that underwent MI or sham-MI surgery on postnatal day (P) 1. Statistically significant differences between cardiomyocytes from P1-MI mice and mice that underwent MI on P8 were identified for six cell-cycle phases and five signaling pathways when the data were analyzed via Sparse Modeling, compared to just one cell-cycle phase and one pathway when the data were analyzed with non-AI techniques. Only Semisupervised Learning detected trajectories between the predominant cardiomyocyte clusters in hearts collected on P28 from pigs that underwent apical resection (AR) on P1, and on P30 from pigs that underwent AR on P1 and MI on P28. In another dataset, the pig scRNAseq data were collected after the injection of CCND2-overexpression Human-induced Pluripotent Stem Cell-derived cardiomyocytes (CCND2hiPSC) into injured P28 pig heart; only the AI-based technique could demonstrate that the host cardiomyocytes increase proliferating by through the HIPPO/YAP and MAPK signaling pathways. For the cluster, pathway/gene set enrichment, and trajectory analysis of scRNAseq datasets generated from studies of myocardial regeneration in mice and pigs, our AI-based toolkit identified results that non-AI techniques did not discover. These different results were validated and were important in explaining myocardial regeneration.

Cluster analysis identified three non-cardiomyocyte cell-types in mouse hearts. Cluster analysis of scRNAseq data from mouse hearts was conducted via (A) AI Autoencoder, (B) Seurat, SC3, RaceID, and CIDR (using the same UMAP), (C) ScanPY, (D) scDHA, (E) ssCCEES, or (F) DCA; then, (Columns i-iii) expression of (i) the fibroblast marker Col1a1, (ii) the endothelial-cell marker Pecam1, (iii) the immune-cell marker Ifi30, and (iv) the smooth muscle cell marker Tagln was quantified across the corresponding UMAP and presented as a heat map. Figure 4 Each cell-type specific cluster contained cells from all injury groups and time points. Cluster analysis of scRNAseq data from mouse hearts was conducted via (A) AI Autoencoder, (B) Seurat, SC3, RaceID, and CIDR, or (C) ScanPY and presented as a UMAP; then, (Rows) cardiomyocytes in hearts collected from each injury group and at each time point were displayed in red. Ventricular and atrial cardiomyocyte clusters were identified in the human cell atlas scRNAseq data using AI-based Autoencoder, Seurat, and ScanPY. The subfigures are organized in a grid structure. Each column (A-C) corresponds to a pipeline. Each row corresponds to a gene expression, in order: cardiomyocyte (overall) markers TNNT2 (i), TTN (ii), RYR2 (iii), ventricular cardiomyocyte markers MYH7 (iv), MYL2 (v), IRX3 (vi), and atrial cardiomyocyte marker HAMP (vii). The 2D landscape in this figure is identical to the corresponding landscape in Figure 6. Endothelial cells, pericytes, and smooth muscle cells were identified in the human cell atlas scRNAseq data using AI-based Autoencoder, Seurat, and ScanPY. The subfigures are organized in a grid structure. Each column (A-C) corresponds to a pipeline. Each row corresponds to a gene expression, in order: endothelial cell markers CDH5 (i), PECAM1 (ii), VWF (iii), pericyte markers ABCC9 (iv), KCNJ8 (v), smooth muscle cell marker TAGLN (vi) and ACTA2 (vii). The 2D landscape in this figure is identical to the corresponding landscape in Figure 6. Monocyte-macrophages and lymphocytes were identified in the human cell atlas scRNAseq data using AIbased Autoencoder, Seurat, and ScanPY. The subfigures are organized in a grid structure. Each column (A-C) corresponds to a pipeline. Each row corresponds to a gene expression, in order: monocyte-macrophage markers CD163 (i), LYVE1 (ii), lymphocyte markers CD3E (iii), CD3G (iv), and CD8A (v). The 2D landscape in this figure is identical to the corresponding landscape in Figure 6. Fibroblasts and Glial cells were identified in the human cell atlas scRNAseq data using AI-based Autoencoder, Seurat, and ScanPY. The subfigures are organized in a grid structure. Each column (A-C) corresponds to a pipeline. Each row corresponds to a gene expression, in order: fibroblast markers DCN (i), GSN (ii), PDGFRA (iii), glial cell markers NRXN1 (iv), NRXN3 (v), and KCNMB4 (vi). The 2D landscape in this figure is identical to the corresponding landscape in Figure 6.

Supplemental Figure 9
A) B)

C)
Using the AI-based Autoencoder embedding, different clustering algorithms result in consistent cluster cell types. After being embedded into just 10 dimension by the Autoencoder, the pig scRNAseq data was clustered by three clustering algorithms, then the cluster results were visualized on the same UMAP coordinate and the same cell type identification method to 37 . A) K-mean clustering result. B) Louvain clustering result. C) Density-based clustering result.

Supplemental Figure 10
Analyzing pathways and biological process upregulated in CCND2 hiPSC-IR P28 -P35 cardiomyocytes, compared to MI P28 -P35 cardiomyocytes. Methods that do not calculate enrichment for each cardiomyocyte include: A) Gene ontology results, produced by DAVID (https://david.ncifcrf.gov) with Seurat-Ranksum result. B) Gene ontology results, produced by DAVID (https://david.ncifcrf.gov) with Seurat-MAST result. Methods that calculate enrichment for each cardiomyocyte include AI-based sparse model and ssGSEA; therefore, violin plots, which summarize the enrichment scores for all cells, were examined among all groups. From C) to L), the left subpanel is the spare model result, and the right subpanel is the ssGSEA result. C) Cell cycle G1 to DNA synthesis phase (G1S). D) Cell cycle DNA synthesis phase (S). E) Cell cycle G2 to Mitosis checkpoint phase (G2M). F) Cytokinesis phase. G) MAPK signaling. H) HIPPO signaling. I) cAMP signaling. J) JAK-STAT signaling. K) RAS signaling. L) TGFβ signaling.

Supplemental Figure 11
C) D) Separation between Human and Pig cells in scRNAseq data. Pig cells: 6095 pig heart cells selected from our lab's data 18 . Human cell: 4737 cells selected from iPSC data 19 . A) The % of SC transcripts mapped to the Human (GRch38) and Pig (Sscrofa10.2) genomes, produced by Cell Ranger pipeline; this result shows that the current 'published' human and 'draft' pig genome is sufficient to distinguish between human and pig cells. B) Combined' human+pig cells on human+pig genome' UMAP plot of the SC data following Cell Ranger. C) Histograms of ratio between the number of pig-genome-mapped transcripts and the number of human-genome-mapped transcripts in each cell; the cyan histogram represents the human cell; the red histogram represents the pig cell. D) UMAP plot of the cells, where the ratio in figure C is demonstrated.